Lesson: Using Text Data Processing 


Each supported language has its own built-in dictionaries to know some types of extractions, 
such as SAP being a company name. Some entities are extracted as PROP_MISC, meaning a 
proper miscellaneous name. This indicates that they system knows the entry is meaningful, 
but cannot determine the type of entity. You can improve the accuracy of results by 
disambiguating the type of these entities in a dictionary. 


e For example, processing the text The World Cup was hosted by South Africa in 
2010 would not identify World Cup as a sporting event using any of the default 
dictionaries. 


e AddWorld Cup and any desired variations to a dictionary as type SPORTING_EVENT to 
resolve these entries during extraction. 


Entity Extraction Transform 

The Entity Extraction transform is located within the Text Data Processing category of the 
Local and Central Object Libraries in Data Services Designer. It includes a single configuration 
that can be used for extraction. The category contains an Entity_Extraction transform 
providing a Base_EntityExtraction transform configuration. 


The Text Data Processing Entity Extraction transform supports processing text, HTML, or 
XML content such as varchar, LONG, or BLOB data types. The content can come from 
multiple sources, such as a flat file, a column of an Excel spreadsheet, or a column of a 
database table. 


The Entity Extraction transform has one mandatory field: Language, which can now identify 
the input text language in any of the over 30 supported languages. 


Filtering Output by Entity Types for Selected Languages 


By default, extraction outputs all entity types available in a selected language. This can often 
lead to excess noise. 


Filtering lets you limit extraction output to entities of the type you select from the filtering 
dialog. 


e Select a language and select the Filter By Entity Types option. 
e Select the ... button. 


e Select one or more entity types and save your changes. 


Advanced Parsing enriches noun phrase extractions with pronouns, numbers, and 
determiners that can be used when writing custom extraction rules. Advanced Parsing is only 
supported for the English language. Enabling advanced parsing does not enable co-reference 
resolution—the ability to relate pronouns to named entities. 


Enabling Advanced Parsing 





Advanced Parsing enriches linguistic processing by including richer noun structure, noun 
phrase coordination, and syntactic function attributes that can be leveraged in custom rules, 
for English language only. 


The English language supports two types of Noun Phrase extraction: one from standard 
parsing and one from advanced parsing. The advanced parser returns the noun phrase with 
related pronouns, numbers, and determiners. It also Supports noun phrase coordination. 


The following list shows the standard phrase with advanced phrases in bold. 
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